Categorical Feature Encoding Techniques for Improved Classifier Performance when Dealing with Imbalanced Data of Fraudulent Transactions

نویسندگان

چکیده

Fraudulent transaction data tend to have several categorical features with high cardinality. It makes preprocessing complicated if categories in such do not an order or meaningful mapping numerical values. Even though many encoding techniques exist, their impact on highly imbalanced massive sets is thoroughly evaluated. Two datasets imbalance lower than 1\% of frauds been used our study. Six methods were employed, which belong either target-agnostic target-based groups. The experimental procedure has involved the use machine-learning techniques, as ensemble learning, along both linear and non-linear learning approaches. Our study emphasizes significance carefully selecting appropriate approach for machine algorithms. Using can enhance model performance significantly. Among various assessed, James-Stein Weight Evidence (WOE) encoders most effective, whereas CatBoost encoder may be optimal datasets. Moreover, it crucial bear mind curse dimensionality when employing like hashing One-Hot encoding.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dealing with Imbalanced Data using Bayesian Techniques

For the present work, we deal with the significant problem of high imbalance in data in binary or multi-class classification problems. We study two different linguistic applications. The former determines whether a syntactic construction (environment) co-occurs with a verb in a natural text corpus consists a subcategorization frame of the verb or not. The latter is called Name Entity Recognitio...

متن کامل

Improved Sampling Techniques for Learning an Imbalanced Data Set

This paper presents the performance of a classifier built using the stackingC algorithm in nine different data sets. Each data set is generated using a sampling technique applied on the original imbalanced data set. Five new sampling techniques are proposed in this paper (i.e., SMOTERandRep, Lax Random Oversampling, Lax Random Undersampling, Combined-Lax Random Oversampling Undersampling, and C...

متن کامل

Mining Imbalanced Data with Learning Classifier Systems

This chapter investigates the capabilities of XCS for mining imbalanced datasets. Initial experiments show that, for moderate and high class imbalances, XCS tends to evolve a large proportion of overgeneral classifiers. Theoretical analyses are developed, deriving an imbalance bound up to which XCS should be able to differentiate between accurate and overgeneral classifiers. Some relevant param...

متن کامل

Extending rule based classifiers for dealing with imbalanced data

Many real world applications involve learning from imbalanced data sets, i.e. data where the minority class of primary importance is under-represented in comparison to majority classes. The high imbalance is an important obstacle for many traditional machine learning algorithms as they are biased towards majority classes. It is desired to improve prediction of interesting, minority class exampl...

متن کامل

Improved Crisp and Fuzzy Clustering Techniques for Categorical Data

Clustering is a widely used technique in data mining application for discovering patterns in underlying data. Most traditional clustering algorithms are limited in handling datasets that contain categorical attributes. However, datasets with categorical types of attributes are common in real life data mining problem. For these data sets, no inherent distance measure, like the Euclidean distance...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal of Computers Communications & Control

سال: 2023

ISSN: ['1841-9844', '1841-9836']

DOI: https://doi.org/10.15837/ijccc.2023.3.5433